The iRefIndex is a collection of protein interactions databases providing and index of canonical interaction pairs and references to the database providing evidence for the interaction. The purpose of this notebook is to extract a binary feature for each database integrated into iRefIndex. These databases are:

BIND
BioGRID
CORUM
DIP
HPRD
InnateDB
IntAct
MatrixDB
MINT
MPact
MPIDB
MPPI
OPHID

To extract this feature we will iterate over the table and use each Entrez Gene protein pair as a key to index the database referring to each entry:



In [1]:

    
cd ../../iRefIndex/









    



/home/gavin/Documents/MRes/iRefIndex



In [4]:

    
import csv



In [13]:

    
import pdb



In [24]:

    
f = open("9606.mitab.08122013.txt")
c = csv.reader(f,delimiter="\t")
irefindexdict = {}
for l in c:
    #extract Gene IDs
    gids = []
    for x in [l[2],l[3]]:
        for s in x.split("|"):
            s = s.split(":")
            if s[0]=="entrezgene/locuslink":
                gids.append(s[1])
    #only add entry to dictionary if there is a pair of Gene IDs
    if len(gids) == 2:
        try:
            irefindexdict[frozenset(gids)] += [l[12]]
        except KeyError:
            irefindexdict[frozenset(gids)] = [l[12]]
f.close()

Now we find the strings corresponding to unique databases:



In [26]:

    
uniqdbs = list(set(flatten(irefindexdict.values())))
print uniqdbs









    



['MI:0465(dip)', 'MI:0469(intact)', 'MI:0463(biogrid)', 'MI:0468(hprd)', 'MI:0000(corum)', 'MI:0000(mppi)', 'MI:0462(bind)', 'MI:0917(matrixdb)', 'MI:0000(bind_translation)', 'MI:0000(ophid)', 'MI:0974(innatedb)']

Using these we can create a dictionary using the same keys as above but using a 1-of-k coding for each database:



In [27]:

    
ireffeaturedict = {}
for k in irefindexdict.keys():
    fvector = []
    for db in uniqdbs:
        if db in irefindexdict[k]:
            fvector.append("1")
        else:
            fvector.append("0")
    ireffeaturedict[k] = fvector

Saving the results

These results will be saved in two ways:

First, the results will be saved to a file using the above unique database identifiers as column labels
Second, the dictionary will be pickled in a class specifically for iRefIndex and this will be saved to be loaded to build feature vectors



In [29]:

    
f = open("human.iRefIndex.Entrez.1ofk.txt", "w")
c = csv.writer(f,delimiter="\t")
c.writerow(["protein1","protein2"]+uniqdbs)
for k in ireffeaturedict.keys():
    pair = list(k)
    if len(pair) == 1:
        pair = pair*2
    c.writerow(pair + ireffeaturedict[k])
f.close()



In [30]:

    
!head human.iRefIndex.Entrez.1ofk.txt



In [31]:

    
import sys



In [32]:

    
sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")



In [35]:

    
import ocbio.irefindex



In [37]:

    
features = ocbio.irefindex.features(ireffeaturedict)



In [38]:

    
import pickle



In [39]:

    
f = open("human.iRefIndex.Entrez.1ofk.pickle","wb")
pickle.dump(features,f)
f.close()